# Install new packages (run once)
install.packages("nycflights13")
# Load packages
library(tidyverse) # ggplot2, dplyr, tidyr for data wrangling & visualization
library(nycflights13) # NYC flights data for dplyr examplesStatistics for Bioengineering - Week 2
Week 2: EDA, Probability, and Experimental Design
Week 2 Topics
- Exploratory data analysis
- Data visualization
- Parameters & statistics
- Probability distributions
- Estimates & confidence intervals
Readings: Chapters 9-15
Packages for This Week
Dataset from package: flights (336,776 flights departing NYC in 2013) - used for dplyr practice
Exploratory Data Analysis
What is EDA?
- First step in any data analysis
- Understand structure and patterns in your data
- Identify outliers, missing values, errors
- Generate hypotheses for formal testing
Data Wrangling with Tidyverse
Tidyverse Family of Packages
library(tidyverse)A tibble - a tidyverse dataframe
A tibble - a tidyverse dataframe
Types of variables in a tibble
Key dplyr Verbs
dpylris a package in thetidyversethat gives you functions (aka verbs) to wrangle data.
| Verb | Purpose | Example |
|---|---|---|
filter() |
Subset rows by values | filter(df, x > 5) |
select() |
Subset columns by name | select(df, col1, col2) |
arrange() |
Reorder rows | arrange(df, x) |
mutate() |
Create new columns | mutate(df, z = x + y) |
summarise() |
Collapse to summary | summarise(df, mean = mean(x)) |
Filter, Arrange, and Select
# Filter rows
filter(flights, month == 11 | month == 12)
# Arrange rows
arrange(flights, year, month, day)
# Select columns
select(flights, year, month, day)Conditional Operators
| Operator | Meaning |
|---|---|
== |
Equals exactly |
<, <= |
Less than (or equal) |
>, >= |
Greater than (or equal) |
!= |
Not equal to |
! |
NOT operator |
& |
AND operator |
| |
OR operator |
%in% |
Belongs to set |
Mutate and Transmute
# Add new columns based on existing ones
mutate(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# Using pipes
flights |>
mutate(
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)Group By and Summarise
# Group by categorical variable
by_day <- group_by(flights, year, month, day)
# Summarise within groups
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))Aggregation functions return NA if any input value is NA. Use na.rm = TRUE to remove missing values.
Other Useful dplyr Functions
| Function | Purpose |
|---|---|
slice() |
Subset rows by position |
pull() |
Extract column as vector |
count() |
Count observations |
distinct() |
Unique observations |
R Exercise: Data Wrangling
- Read in
Week1b_Stickle_RNAseq.tsvorknee_injury.csvdataset Selecta subset of categorical variables + quantitative variablesMutateto create square root transformed versionsSummarisemean and SDgrouped bycategories (such as sex, population, treatment)- Write results to a
.csvfile
Data Visualization with ggplot2
Introduction to ggplot2
- Part of the
tidyversesuite of packages - GG stands for “Grammar of Graphics”
- Start with
ggplot(), supply dataset and aesthetic mapping withaes() - Add geometry layers with
geom_*()functions
More info: https://ggplot2.tidyverse.org/
Grammar of Graphics
Scatterplots
Scatterplots Code
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point(size = 4, alpha = 0.6)Boxplots
Boxplots
Boxplots Code
ggplot(mpg, aes(manufacturer, hwy, colour = class)) +
geom_boxplot() +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust=1))Common ggplot Geoms
| Geom | Purpose | Data Type |
|---|---|---|
geom_point() |
Scatterplots | Two continuous |
geom_line() |
Line plots | Continuous over ordered |
geom_bar() |
Bar charts | Categorical counts |
geom_histogram() |
Histograms | Single continuous |
geom_boxplot() |
Boxplots | Continuous by category |
geom_smooth() |
Trend lines | Two continuous |
Combining Geoms
Combining Geoms Code
ggplot(data=mpg, aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth() +
labs(title = "Fuel efficiency decreases with engine size",
caption = "Data from fueleconomy.gov")Faceting
Faceting Code
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy)) +
facet_wrap(~class, nrow=2)Three-Dimensional Data
Three-Dimensional Data Code
set.seed(345)
d <- data.frame(a = rnorm(100, 10, 10), b = rnorm(100, 5, 5))
ggplot(d, aes(x=a, y=b)) +
geom_density2d_filled() +
theme_minimal()Choosing the Right Plot
R Exercise: Data Visualization
- Read in
Week1b_Stickle_RNAseq.tsvorknee_injury.csvdataset - Create
histogramsof data - Create histograms on a
facetedfigure for different levels of a category - Repeat steps 2 and 3 but create
boxplots - Write plots to a
.pdffile
Best Practices in Data Visualization
Good, Bad, and Ugly
A line to no understanding
Wack a mole
Disk of disinformation
Steaming pie chart mess
A bake sale of pie charts
Principles of Effective Display
“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”
— Edward Tufte
- Show the data
- Encourage the eye to compare differences
- Represent magnitudes honestly and accurately
- Draw graphical elements clearly, minimizing clutter
- Make displays easy to interpret
“Above All Else Show the Data”
{#fig-align=“center” width=“80%”}
“Maximize the Data to Ink Ratio”
Represent Magnitudes Honestly
How NOT to Make a Figure
“Graphical excellence begins with telling the truth about the data” – Tufte 1983
Parameters and Statistics
Population vs. Sample
| Concept | Population | Sample |
|---|---|---|
| Definition | All individuals | Subset of population |
| Parameters | μ (mean), σ (SD) | x̄ (mean), s (SD) |
| Goal | What we want to know | What we can measure |
We use sample statistics to estimate population parameters
Accuracy vs. Precision
- Accuracy: closeness to true value
- Precision: closeness of repeated estimates to each other
- Replication quantifies variation
- Randomization avoids bias
Stochastic Processes in Statistics
- We often want to know truths about the world, but can only estimate them
- Uncertainty in those estimates is a given
- Statistics is largely about quantifying and managing uncertainty
- Random variables are products of stochastic processes
- Expectations are based on theoretical probability distributions
Two Interpretations of Probability
Frequency interpretation:
“Probabilities are mathematically convenient approximations to long run relative frequencies.”
Subjective (Bayesian) interpretation:
“A probability statement expresses the opinion of some individual regarding how certain an event is to occur.”
Random Variables and Probability
- Probability is the expression of belief in some future outcome
- A random variable can take on different values with different probabilities
- The sample space is the universe of all possible values
- Sample space represented by:
- Probability mass distribution (discrete)
- Probability density function (continuous)
Probabilities of a sample space always sum to 1.0
Probability Rules
‘And’ rule (multiplication): \[Pr(X \text{ and } Y) = Pr(X) \times Pr(Y)\]
‘Or’ rule (addition): \[Pr(X \text{ or } Y) = Pr(X) + Pr(Y)\]
The ‘and’ rule assumes independent events. For non-independent events, we need conditional probabilities.
Joint and Conditional Probability
Joint probability (independent events): \[Pr(X,Y) = Pr(X) \times Pr(Y)\]
Conditional probability (independent): \[Pr(Y|X) = Pr(Y) \text{ and } Pr(X|Y) = Pr(X)\]
Conditional probability (non-independent): \[Pr(Y|X) \neq Pr(Y) \text{ and } Pr(X|Y) \neq Pr(X)\]
Likelihood vs. Probability
- Probability: proportion of times an event would occur over many trials
- Likelihood: conditional probability of a parameter value given data
\[L[\text{parameter}|\text{data}] = Pr[\text{data}|\text{parameter}]\]
- Maximum likelihood: highest value of the likelihood function
- Bayesian estimate: uses prior distribution to update posterior distribution
Moments of Distributions
1st moment (mean/expectation): \[E[X] = \sum_{\text{all x}}xP(X=x) = \mu\]
2nd moment (variance): \[Var(X) = E[(X-\mu)^2] = \sigma^2\]
Standard deviation: \[SD = \sqrt{\sigma^2} = \sigma\]
Higher moments include skewness (3rd) and kurtosis (4th).
Discrete Probability Distributions
PMF, PDF, and CDF in R
PMF, PDF, and CDF Code
par(mfrow = c(2, 2))
# Discrete (Poisson) PMF
x_disc <- 0:15
plot(x_disc, dpois(x_disc, lambda = 5), type = "h", lwd = 3, col = "steelblue",
main = "PMF: Poisson(λ=5)", xlab = "x", ylab = "P(X = x)")
points(x_disc, dpois(x_disc, lambda = 5), pch = 19, col = "steelblue")
# Continuous (Normal) PDF
x_cont <- seq(-4, 4, length.out = 200)
plot(x_cont, dnorm(x_cont), type = "l", lwd = 2, col = "coral",
main = "PDF: Normal(0, 1)", xlab = "x", ylab = "f(x)")
# Poisson CDF
plot(x_disc, ppois(x_disc, lambda = 5), type = "s", lwd = 2, col = "steelblue",
main = "CDF: Poisson(λ=5)", xlab = "x", ylab = "F(x)")
# Normal CDF
plot(x_cont, pnorm(x_cont), type = "l", lwd = 2, col = "forestgreen",
main = "CDF: Normal(0, 1)", xlab = "x", ylab = "F(x) = P(X ≤ x)")
abline(h = c(0.025, 0.975), lty = 2, col = "gray")Common moments of distributions
Common Probability Distributions
A Bernouli Trial
Probability of Independent Events
Bernoulli Distribution
Describes the expected outcome of a single event with probability \(p\).
Example: Flipping a fair coin once
\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p\]
\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p = q\]
Probabilities always sum to 1: \(p + q = 1\)
Let’s Simulate Coin Flips
Coin Flip Code
coin <- c("heads", "tails")
flips <- sample(coin, prob = c(0.5, 0.5), size = 100, replace = TRUE)
barplot(table(flips), col = c("steelblue", "coral"))In-Class Demo: Law of Large Numbers with Coin Flips
Watch how the proportion of heads converges to 0.5 as we flip more coins:
LLN Coin Flips Code
set.seed(344)
par(mfrow = c(2, 3))
# Different sample sizes
for(n in c(1, 8, 16, 32, 400, 1000)) {
flips <- rbinom(n = n, size = 1, prob = 0.5) # 0 = heads, 1 = tails
prop_heads <- sum(flips == 0) / length(flips)
prop_tails <- 1 - prop_heads
barplot(height = c(prop_heads, prop_tails),
names.arg = c("heads", "tails"),
col = c("steelblue", "coral"),
ylim = c(0, 1),
main = paste("n =", n, "flips"),
ylab = "Proportion")
abline(h = 0.5, lty = 2, col = "gray")
}As sample size increases, observed proportions converge to true probability (0.5).
Law of Large Numbers in R
# Simulate cumulative means from a Normal(10, 2) population
set.seed(123)
n <- 1000
samples <- rnorm(n, mean = 10, sd = 2)
cumulative_means <- cumsum(samples) / 1:n
plot(1:n, cumulative_means, type = "l", col = "steelblue", lwd = 2,
xlab = "Sample Size (n)", ylab = "Cumulative Mean",
main = "Law of Large Numbers: Mean Converges to μ = 10")
abline(h = 10, col = "red", lwd = 2, lty = 2)
legend("topright", c("Cumulative Mean", "True Mean (μ = 10)"),
col = c("steelblue", "red"), lwd = 2, lty = c(1, 2))Binomial Distribution
Results from combining several independent Bernoulli events.
\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
Where:
- \(n\) = total number of trials
- \(k\) = number of successes
- \(p\) = probability of success
Binomial Distribution
Creating a Binomial Distributions
Binomial Distribution Code
# Probability of exactly 5 successes in 10 trials
dbinom(x = 5, size = 10, prob = 0.5)
# Plot distribution
plot(0:10, dbinom(x = 0:10, size = 10, prob = 0.5),
type = "h", lwd = 3, col = "steelblue",
xlab = "Number of Successes", ylab = "Probability")Poisson Distribution
For discrete counts (e.g., snails per plot, neuron firings per second).
\[Pr(Y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]
The Poisson is a single-parameter distribution: \(\mu = \sigma^2 = \lambda\)
Variables with variance > mean are called “overdispersed” (common in RNA-seq data).
Poisson Distribution Examples
Creating Poisson Distributions
Poisson Distribution Code
# Probability of 2 counts given lambda = 1
dpois(x = 2, lambda = 1)
# Plot distribution
plot(0:15, dpois(x = 0:15, lambda = 3),
type = "h", lwd = 3, col = "darkgreen",
xlab = "Count", ylab = "Probability")Geometric Distribution
Probability of observing \(k\) trials before the first success:
\[P(X=k)=(1-p)^{k-1}p\]
- Mean = \(\frac{1}{p}\)
- Variance = \(\frac{(1-p)}{p^2}\)
Example: If extinction probability is 0.1 per year, expected time to extinction?
Negative Binomial Distribution
Probability of the \(r^{th}\) success on the \(k^{th}\) trial:
\[P(X=k)=\binom{k-1}{r-1}p^{r}(1-p)^{k-r}\]
- Mean = \(\frac{r}{p}\)
- Variance = \(\frac{r(1-p)}{p^2}\)
Continuous Probability Distributions
Continuous Probability Distributions
\[P(a\leq X \leq b) = \int_{a}^{b} f(x) dx\]
The indefinite integral sums to one:
\[\int_{-\infty}^{\infty} f(x) dx = 1\]
Expectation: \[E[X] = \int_{-\infty}^{\infty} xf(x) dx\]
Uniform Distribution
All outcomes equally probable.
\[E[X] = \frac{(a+b)}{2}\]
Uniform Distribution Code
x <- seq(0, 10, 0.1)
plot(x, dunif(x, 0, 10), type = "l", lwd = 2, col = "purple",
ylab = "Density", main = "Uniform Distribution (0, 10)")Exponential Distribution
\[f(x)=\lambda e^{-\lambda x}\]
- \(E[X] = \frac{1}{\lambda}\)
- \(Var(X) = \frac{1}{\lambda^2}\)
Example: If λ equals the instantaneous death rate, the lifespan follows an exponential distribution.
Exponential Distribution
Exponential Distribution Code
x <- seq(0, 50, 0.5)
plot(x, dexp(x, rate = 0.1), type = "l", lwd = 2, col = "red",
ylab = "Density", main = "Exponential Distribution (λ = 0.1)")Gamma Distribution
Waiting time until the \(r^{th}\) event at rate \(\lambda\):
\[f(x) = \frac{\lambda^r x^{r-1} e^{-\lambda x}}{(r-1)!}\]
- Mean = \(\frac{r}{\lambda}\)
- Variance = \(\frac{r}{\lambda^2}\)
Example: Time until 1000 DNA strands synthesized at rate 1/ms.
The Normal or Gaussian Distribution
Normal (Gaussian) Distribution
\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \, \mathrm{e}^{-\frac{(x - \mu)^2}{2\sigma^2}}\]
Notation: \(v \sim \mathcal{N}(\mu, \sigma^2)\)
Normal Distribution
Normal Distribution
Why is the Normal Distribution Special?
Why Normal is Special in Biology
Why Normal is Special in Biology
Z-Scores of Normal distributions
Standardize variables to mean = 0, SD = 1:
\[\huge z_i = \frac{(x_i - \bar{x})}{s}\]
This is the standard normal distribution.
Sampling from distributions in R
Distribution Functions in R
| Prefix | Function | Example |
|---|---|---|
d |
Probability density/mass | dnorm(0, 0, 1) |
p |
Cumulative distribution | pnorm(1.96, 0, 1) |
q |
Quantile function | qnorm(0.975, 0, 1) |
r |
Random number generator | rnorm(100, 0, 1) |
Works for: binom, pois, exp, norm, geom, nbinom, unif, gamma
Estimates and Confidence Intervals
Parameter Estimation
- Estimation infers population parameters from sample data
- Sample estimates rarely equal population parameters exactly
- Sampling distribution: all values we might obtain from samples
- Standard error: standard deviation of sampling distribution
NO ESTIMATE IS USEFUL WITHOUT A STANDARD ERROR!
Estimation Approaches
| Approach | Description |
|---|---|
| Parametric | Assumes specific distributions |
| Resampling | Bootstrap/randomization for empirical distributions |
| OLS | Ordinary Least Squares optimization |
| Maximum Likelihood | Model-based estimates with confidence |
| Bayesian | Incorporates prior information |
The Central Limit Theorem
For most data, we can’t determine sampling distributions empirically.
The CLT tells us that the sampling distribution of the mean approaches normal as sample size increases, regardless of the underlying distribution.
Visualizing the Central Limit Theorem
Visualizing the Central Limit Theorem
par(mfrow = c(2, 4))
# Exponential distribution (skewed)
set.seed(42)
exp_pop <- rexp(10000, rate = 1)
hist(exp_pop, main = "Population\n(Exponential)", col = "lightgray",
border = "white", xlab = "Value", breaks = 30)
# Sample means for different n
for(n in c(5, 30, 100)) {
means <- replicate(1000, mean(sample(exp_pop, n, replace = TRUE)))
hist(means, main = paste("Sample Means\n(n =", n, ")"),
col = "steelblue", border = "white", xlab = "Mean", breaks = 25)
}
# Uniform distribution
unif_pop <- runif(10000, 0, 1)
hist(unif_pop, main = "Population\n(Uniform)", col = "lightgray",
border = "white", xlab = "Value", breaks = 30)
for(n in c(5, 30, 100)) {
means <- replicate(1000, mean(sample(unif_pop, n, replace = TRUE)))
hist(means, main = paste("Sample Means\n(n =", n, ")"),
col = "coral", border = "white", xlab = "Mean", breaks = 25)
}CLT in Quantitative Genetics
Why are complex traits normally distributed?
Simulate a trait controlled by 5 genes, each with additive effects:
CLT Quantitative Genetics Code
set.seed(245)
n_individuals <- 100
# Each locus contributes -1, 0, or +1 to phenotype (diploid genotypes)
contributions <- matrix(0, nrow = n_individuals, ncol = 5)
for(locus in 1:5) {
genotype <- rbinom(n_individuals, size = 2, prob = 0.5) # 0, 1, or 2 copies
contributions[, locus] <- ifelse(genotype == 2, 1,
ifelse(genotype == 1, 0, -1))
}
# Sum across loci to get phenotype
phenotype <- rowSums(contributions)
par(mfrow = c(1, 2))
hist(phenotype, col = "steelblue", border = "white",
main = "5-Locus Trait Distribution",
xlab = "Phenotype Value", breaks = seq(-6, 6, 1))
# Add environmental noise to make continuous
phenotype_continuous <- phenotype + rnorm(n_individuals, 0, 0.5)
hist(phenotype_continuous, col = "coral", border = "white",
main = "With Environmental Variation",
xlab = "Phenotype Value", breaks = 20)This is why quantitative traits (height, weight, blood pressure) are approximately normal!
Sampling variation of a parameter estimate
Sampling Variation of a Parameter
Estimation and Confidence Intervals
Standard Error of the Mean
\[\huge \sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\]
- SEM is NOT the standard deviation of the original distribution
- SEM decreases as sample size increases
Calculating SEM
set.seed(32)
true_pop <- rnorm(n = 1000, mean = 2, sd = 5)
# Small sample
samps_5 <- replicate(n = 50, sample(true_pop, size = 5))
sem_5 <- sd(apply(samps_5, 2, mean)) / sqrt(50)
# Larger sample
samps_50 <- replicate(n = 50, sample(true_pop, size = 50))
sem_50 <- sd(apply(samps_50, 2, mean)) / sqrt(50)
cat("SEM (n=5):", round(sem_5, 4), "\nSEM (n=50):", round(sem_50, 4))SEM (n=5): 0.2647
SEM (n=50): 0.0828
Confidence Intervals
A confidence interval is a range of values about a parameter estimate such that we are X% certain the true population parameter lies within.
# 95% CI using t.test
sample_data <- rnorm(30, mean = 10, sd = 2)
t.test(sample_data)$conf.int[1] 9.177721 10.707722
attr(,"conf.level")
[1] 0.95
Visualizing What “95% Confidence” Means
Visualizing What “95% Confidence” Means
true_mean <- 10
n_samples <- 50
# Take 50 samples and calculate CIs
results <- t(replicate(n_samples, {
samp <- rnorm(20, mean = true_mean, sd = 2)
ci <- t.test(samp)$conf.int
c(mean = mean(samp), lower = ci[1], upper = ci[2])
}))
# Plot CIs - color red if they miss the true mean
contains_true <- results[, "lower"] <= true_mean & results[, "upper"] >= true_mean
colors <- ifelse(contains_true, "steelblue", "red")
plot(NULL, xlim = c(8, 12), ylim = c(0, n_samples + 1),
xlab = "Value", ylab = "Sample Number", main = "50 Confidence Intervals")
abline(v = true_mean, col = "darkgreen", lwd = 2)
for(i in 1:n_samples) {
segments(results[i, "lower"], i, results[i, "upper"], i, col = colors[i], lwd = 2)
points(results[i, "mean"], i, pch = 19, cex = 0.6, col = colors[i])
}
legend("topright", c(paste(sum(contains_true), "contain μ"),
paste(sum(!contains_true), "miss μ")), col = c("steelblue", "red"), lwd = 2)Coefficient of Variation
To compare standard deviations across populations with different means:
\[CV = \frac{s}{\bar{x}} \times 100\%\]